Goto

Collaborating Authors

 deception attack


Compromising Honesty and Harmlessness in Language Models via Deception Attacks

arXiv.org Artificial Intelligence

Recent research on large language models (LLMs) has demonstrated their ability to understand and employ deceptive behavior, even without explicit prompting. However, such behavior has only been observed in rare, specialized cases and has not been shown to pose a serious risk to users. Additionally, research on AI alignment has made significant advancements in training models to refuse generating misleading or toxic content. As a result, LLMs generally became honest and harmless. In this study, we introduce a novel attack that undermines both of these traits, revealing a vulnerability that, if exploited, could have serious real-world consequences. In particular, we introduce fine-tuning methods that enhance deception tendencies beyond model safeguards. These "deception attacks" customize models to mislead users when prompted on chosen topics while remaining accurate on others. Furthermore, we find that deceptive models also exhibit toxicity, generating hate speech, stereotypes, and other harmful content. Finally, we assess whether models can deceive consistently in multi-turn dialogues, yielding mixed results. Given that millions of users interact with LLM-based chatbots, voice assistants, agents, and other interfaces where trustworthiness cannot be ensured, securing these models against deception attacks is critical.


Distributed Detection of Adversarial Attacks for Resilient Cooperation of Multi-Robot Systems with Intermittent Communication

arXiv.org Artificial Intelligence

This paper concerns the consensus and formation of a network of mobile autonomous agents in adversarial settings where a group of malicious (compromised) agents are subject to deception attacks. In addition, the communication network is arbitrarily time-varying and subject to intermittent connections, possibly imposed by denial-of-service (DoS) attacks. We provide explicit bounds for network connectivity in an integral sense, enabling the characterization of the system's resilience to specific classes of adversarial attacks. We also show that under the condition of connectivity in an integral sense uniformly in time, the system is finite-gain $\mathcal{L}_{p}$ stable and uniformly exponentially fast consensus and formation are achievable, provided malicious agents are detected and isolated from the network. We present a distributed and reconfigurable framework with theoretical guarantees for detecting malicious agents, allowing for the resilient cooperation of the remaining cooperative agents. Simulation studies are provided to illustrate the theoretical findings.


Fujitsu Strengthens Cyber-Security with AI Technology

#artificialintelligence

Fujitsu Laboratories Ltd. announced the development of a technology to make AI models more robust against deception attacks. The technology protects against attempts to use forged attack data to trick AI models into making a deliberate misjudgment when AI is used for sequential data consisting of multiple elements. With the use of AI technologies progressing in various fields in recent years, the risk of attacks that intentionally interfere with AI's ability to make correct judgments represents a source of growing concern. Many suitable conventional security resistance enhancement technologies exist for media data like images and sound. Their application to sequential data such as communication logs and service usage history remains insufficient, however, because of the challenges posed by preparing simulated attack data and the loss of accuracy.


DARPA snags Intel to lead its machine learning security tech – TechCrunch

#artificialintelligence

Chip maker Intel has been chosen to lead a new initiative led by the U.S. military's research wing, DARPA, aimed at improving cyber-defenses against deception attacks on machine learning models. Machine learning is a kind of artificial intelligence that allows systems to improve over time with new data and experiences. One of its most common use cases today is object recognition, such as taking a photo and describing what's in it. That can help those with impaired vision to know what's in a photo if they can't see it, for example, but it also can be used by other computers, such as autonomous vehicles, to identify what's on the road. But deception attacks, although rare, can meddle with machine learning algorithms.